Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalidate PackedInstruction cache upon modification #13543

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

raynelfss
Copy link
Contributor

@raynelfss raynelfss commented Dec 10, 2024

Summary

This should supercede #13532, and should correctly fix #13504
A previous mis-implementation allowed shared references between cloned instances of PackedInstruction when it came to their Python counterparts. The following commits re-implement the Clone trait and makes PackedInstruction attributes accessible only through views. This is done to allow the cached PyGate to be discarded upon modification of op, params or extra_attrs attributes of said instance of PackedInstruction.

Details and comments

As referenced by @Cryoris in #13504, we were finding unknown parameters when performing translations in parallel. This was due to modifications coming from BasisTranslator helper function compose_transforms which would modify certain parameters from a circuit. The reason for this was that we had been unintentionally copying references to the cached python gates stored within these operation instances.

The following commits perform a couple of changes on the PackedInstruction class to prevent this from happening:

  • Implement the Clone trait for PackedInstruction such that the cached py_gate reference is discarded.
  • Control any modifications of the cached_pygate by discarding the cache whenever these attributes are borrowed mutably. This was done by:
    • Making all attributes of the PackedInstruction private.
    • Create getters and setters (more accurately mutable getters) that control when it is proper to discard the cached gate.

These changes had a propagated effect in the codebase and a lot of other pieces of code had to be modified for it to work with it.

Notes:

  • There are a couple of performance regressions related to circuit construction time. This is a tradeoff from having to rebuild the cached gate many more times than it was required before.

Co-authored-by: Julien Gacon gaconju@gmail.com

- Make all attributes of the `PackedInstruction` private to allow for control of the underlying cached `PyObject` reference stored in Python. This ensures that no dead references are left as soon as the gate is modified from the Rust side.
- Reimplement `Clone` trait for `PackedInstruction`s to clear the cached gate whenever we decide make calls to `clone()`.
- Re-organize rest of the code to adapt to the new changes.
@raynelfss raynelfss changed the title [WIP] Invalidate PackedInstruction cached upon modification [WIP] Invalidate PackedInstruction cache upon modification Dec 10, 2024
@Cryoris
Copy link
Contributor

Cryoris commented Dec 10, 2024

While I'm not sure how to test the original bug in CI (since we turn Qiskit in parallel off) we can test @jakelishman's reproducer from: #13532 (comment) 🙂

@coveralls
Copy link

coveralls commented Dec 10, 2024

Pull Request Test Coverage Report for Build 12280844236

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 628 of 644 (97.52%) changed or added relevant lines in 24 files are covered.
  • 18 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.03%) to 88.966%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/commutation_cancellation.rs 6 7 85.71%
crates/circuit/src/packed_instruction.rs 84 87 96.55%
crates/accelerate/src/basis/basis_translator/mod.rs 76 82 92.68%
crates/circuit/src/dag_circuit.rs 209 215 97.21%
Files with Coverage Reduction New Missed Lines %
crates/accelerate/src/unitary_synthesis.rs 1 92.14%
crates/accelerate/src/basis/basis_translator/compose_transforms.rs 1 97.44%
crates/circuit/src/interner.rs 3 88.52%
crates/qasm2/src/lex.rs 3 92.23%
crates/accelerate/src/basis/basis_translator/mod.rs 4 89.42%
crates/circuit/src/circuit_data.rs 6 94.82%
Totals Coverage Status
Change from base Build 12236482718: 0.03%
Covered Lines: 79388
Relevant Lines: 89234

💛 - Coveralls

@raynelfss
Copy link
Contributor Author

These changes do fix things but there is a big performance regression in circuit construction. Since I removed a lot of the pre-caching that was being done prior to introducing these changes, I thought adding those back would fix the performance regression, however, local testing proved that including those made things even slower. I will post those numbers later, but slowdowns were as high as 2x slower. This is the current regression:


Change Before [841cb29] After [b337cf3] Ratio Benchmark (Parameter)
+ 60.9±2μs 93.0±4μs 1.53 circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8192)
+ 215±4μs 315±6μs 1.47 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 32768)
+ 216±9μs 314±9μs 1.46 circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 32768)
+ 838±30μs 1.21±0.01ms 1.44 circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 131072)
+ 60.9±1μs 87.3±2μs 1.43 circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8192)
+ 223±10μs 315±10μs 1.41 circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 32768)
+ 215±6μs 301±8μs 1.4 circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 32768)
+ 61.4±1μs 85.5±7μs 1.39 circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8192)
+ 1.32±0.02ms 1.80±0.06ms 1.37 circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 131072)
+ 220±6μs 300±8μs 1.36 circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 32768)
+ 19.3±0.4μs 26.1±0.5μs 1.35 circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 2048)
+ 223±6μs 301±10μs 1.35 circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 32768)
+ 20.6±0.4μs 27.9±1μs 1.35 circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 2048)
+ 22.1±0.9μs 29.8±1μs 1.35 circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 2048)
+ 60.0±3μs 79.3±4μs 1.32 circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8192)
+ 1.33±0.02ms 1.75±0.09ms 1.32 circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 131072)
+ 61.0±2μs 80.4±4μs 1.32 circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8192)
+ 64.7±1μs 84.5±0.9μs 1.31 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8192)
+ 24.0±0.8μs 31.0±2μs 1.29 circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 2048)
+ 20.9±0.6μs 27.0±0.7μs 1.29 circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 2048)
+ 1.44±0.08ms 1.83±0.07ms 1.27 circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 131072)
+ 1.34±0.02ms 1.67±0.03ms 1.25 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 131072)
+ 26.0±0.5μs 32.0±0.7μs 1.23 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 2048)
+ 1.37±0.06ms 1.68±0.1ms 1.23 circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 131072)
+ 5.92±0.07ms 6.90±0.3ms 1.16 circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(100, 10)
+ 7.73±0.06ms 8.56±0.2ms 1.11 circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(50, 50)
+ 6.34±0.09μs 6.98±0.1μs 1.1 circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 128)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

With this comes the question of whether we should continue to explore ways of optimizing this and, were we not to find a way of optimizing, should we choose having a more stable solution in exchange of a potentially slower circuit builder?

@raynelfss
Copy link
Contributor Author

While I'm not sure how to test the original bug in CI (since we turn Qiskit in parallel off) we can test @jakelishman's reproducer from: #13532 (comment) 🙂

It seems to work well with this based on local tests.

@kevinhartman kevinhartman added this to the 1.3.1 milestone Dec 10, 2024
@raynelfss
Copy link
Contributor Author

Other than the circuit instruction benchmarks, not much seems to have changed in the utility scale, although a couple benchmarks seem to have slowed down.


Utility scale benchmarks:

Change Before [841cb29] After [b337cf3] Ratio Benchmark (Parameter)
0 0 n/a utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('cx')
0 0 n/a utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('cz')
0 0 n/a utility_scale.UtilityScaleBenchmarks.track_bvlike_depth('ecr')
328±8ms 343±10ms 1.05 utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')
345±8ms 357±4ms 1.03 utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')
500±10ms 513±10ms 1.03 utility_scale.UtilityScaleBenchmarks.time_qft('cx')
7.12±0.1ms 7.24±0.2ms 1.02 utility_scale.UtilityScaleBenchmarks.time_bvlike('cz')
2.77±0.01s 2.82±0.02s 1.02 utility_scale.UtilityScaleBenchmarks.time_circSU2('cz')
2.77±0.04s 2.79±0.05s 1.01 utility_scale.UtilityScaleBenchmarks.time_circSU2('ecr')
4.86±0.2ms 4.93±0.1ms 1.01 utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('ecr')
51.7±0.8ms 52.1±0.9ms 1.01 utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cz')
51.5±1ms 52.2±0.6ms 1.01 utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('ecr')
16.5±0.4ms 16.6±0.1ms 1.01 utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cx')
279±3ms 283±4ms 1.01 utility_scale.UtilityScaleBenchmarks.time_qaoa('cx')
608±20ms 614±20ms 1.01 utility_scale.UtilityScaleBenchmarks.time_qft('cz')
103±1ms 103±2ms 1.00 utility_scale.UtilityScaleBenchmarks.time_bv_100('cz')
2.79±0.03s 2.78±0.01s 1.00 utility_scale.UtilityScaleBenchmarks.time_circSU2('cx')
51.6±0.8ms 51.6±1ms 1.00 utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cx')
16.6±0.5ms 16.6±0.4ms 1.00 utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cz')
16.4±0.4ms 16.5±0.4ms 1.00 utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('ecr')
661±20ms 659±10ms 1.00 utility_scale.UtilityScaleBenchmarks.time_qv('cx')
733±20ms 737±20ms 1.00 utility_scale.UtilityScaleBenchmarks.time_qv('ecr')
390 390 1.00 utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cx')
397 397 1.00 utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cz')
397 397 1.00 utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('ecr')
300 300 1.00 utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cx')
300 300 1.00 utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cz')
300 300 1.00 utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('ecr')
1607 1607 1.00 utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cx')
1622 1622 1.00 utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cz')
1622 1622 1.00 utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('ecr')
1954 1954 1.00 utility_scale.UtilityScaleBenchmarks.track_qft_depth('cx')
1954 1954 1.00 utility_scale.UtilityScaleBenchmarks.track_qft_depth('cz')
1954 1954 1.00 utility_scale.UtilityScaleBenchmarks.track_qft_depth('ecr')
2709 2709 1.00 utility_scale.UtilityScaleBenchmarks.track_qv_depth('cx')
2709 2709 1.00 utility_scale.UtilityScaleBenchmarks.track_qv_depth('cz')
2709 2709 1.00 utility_scale.UtilityScaleBenchmarks.track_qv_depth('ecr')
462 462 1.00 utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cx')
462 462 1.00 utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cz')
462 462 1.00 utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('ecr')
98.2±3ms 97.7±1ms 0.99 utility_scale.UtilityScaleBenchmarks.time_bv_100('cx')
103±1ms 102±2ms 0.99 utility_scale.UtilityScaleBenchmarks.time_bv_100('ecr')
7.20±0.2ms 7.13±0.2ms 0.99 utility_scale.UtilityScaleBenchmarks.time_bvlike('cx')
4.90±0.1ms 4.87±0.08ms 0.99 utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cz')
125±3ms 124±1ms 0.99 utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')
124±3ms 123±2ms 0.99 utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr')
5.00±0.04ms 4.89±0.07ms 0.98 utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cx')
611±20ms 602±10ms 0.98 utility_scale.UtilityScaleBenchmarks.time_qft('ecr')
760±20ms 739±3ms 0.97 utility_scale.UtilityScaleBenchmarks.time_qv('cz')
7.48±0.1ms 7.19±0.2ms 0.96 utility_scale.UtilityScaleBenchmarks.time_bvlike('ecr')
115±3ms 110±2ms 0.96 utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Transpiler levels

Change Before [841cb29] After [b337cf3] <rebalance-packed-instructions~1> Ratio Benchmark (Parameter)
+ 40.3±2ms 49.5±0.2ms 1.23 transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)
+ 40.5±1ms 49.2±0.2ms 1.21 transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)
- 12.3±0.2ms 11.1±0.2ms 0.9 transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.


- Add views for `py_op`.
- Remove stale saching being done in `apply_operation_*` in `DAGCircuit`.
- Adapt the code to new changes.
- Add missing docstring
@raynelfss
Copy link
Contributor Author

There seems to be a strange effect in which some benchmarks improve while others get worse, which is strange:

Change Before [841cb29] After [590b4a3a] Ratio Benchmark (Parameter)
+ 16.7±0.5ms 22.4±0.3ms 1.35 transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')
+ 38.5±0.2ms 50.2±0.5ms 1.3 transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)
+ 40.9±0.9ms 49.6±0.2ms 1.21 transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)
- 24.0±0.5ms 16.5±0.4ms 0.69 transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')
- 23.3±0.2ms 15.7±0.1ms 0.67 transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'dense')
- 24.4±0.5ms 15.8±0.2ms 0.65 transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@raynelfss raynelfss added bug Something isn't working Changelog: Bugfix Include in the "Fixed" section of the changelog mod: circuit Related to the core of the `QuantumCircuit` class or the circuit library priority: high labels Dec 10, 2024
@raynelfss raynelfss marked this pull request as ready for review December 10, 2024 19:23
@raynelfss raynelfss requested a review from a team as a code owner December 10, 2024 19:23
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

@raynelfss raynelfss changed the title [WIP] Invalidate PackedInstruction cache upon modification Invalidate PackedInstruction cache upon modification Dec 10, 2024
@raynelfss raynelfss added the stable backport potential The bug might be minimal and/or import enough to be port to stable label Dec 10, 2024
Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far, but I've noted a few concerns inline.

Specifically, it isn't entirely clear to me when we can preserve the py_op, if at all. In its current state, it looks like this PR always throws it out, even when building new instances of PackedInstruction via ::new, and even when cloning a singleton gate. If we're doing so unnecessarily, that is certainly performance we're leaving (/putting back) on the table. Do you know if that's the case @raynelfss?

It'd be great if @jakelishman or @mtreinish can confirm prior to the potential inclusion of this PR in 1.3.1 this week, though they're both on holiday.

#[cfg(feature = "cache_pygates")]
{
inst.py_op = py_op.unbind().into();
*inst.py_op_mut() = py_op.unbind().into();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider constructing a new PackedInstruction with a user-provided Python op for the cache as unsafe, then we should also consider this to be unsafe, correct? Since the instance we get by calling func is something the user may have a reference to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your reasoning, I'm okay with making the py_op attribute fully invisible since it is still unsafe (even when immutably borrowed).

params: (!py_op.params.is_empty()).then(|| Box::new(py_op.params)),
extra_attrs: py_op.extra_attrs,
#[cfg(feature = "cache_pygates")]
py_op: op.unbind().into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to throw away the py_op when creating new PackedInstruction instances?

Surely we anticipated that it would be an issue for Python-side code to modify the cached objects at the time we introduced the py_op cache. That makes me think we should preserve py_op as part of PackedOperation::new. Isn't the issue only via the default Clone for PackedInstruction?

I realize that @jakelishman is on holiday, but perhaps if he sees this he can clarify.

Copy link
Contributor Author

@raynelfss raynelfss Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but also think of it this way.

If you look at the PackedInstruction::unpack_py_op() method, in which a reference is built to a python object (basically what we do when the cache_pygate flag is disabled), in the case of a standard gate, we build a new instance of py_op based on the current parameters or the extra attributes of our gate, and not only that, but the instance we get from this function whenever dealing with any other type of gate comes directly from the OperationRef type if it contains a python object within and isn't really affected by changes unless our op type is StandardGate.

Why do I mention this? Well, whenever a gate is initialized without a cached pygate, this is the function in charge of building that gate for us. Since this gate is disconnected and changes won't propagate, we'd need to update it by creating a new one whenever our params or attributes are modified (or at least that's what I understand from here).

We could technically optimize further by only re-building if we have an instance of Standard gate, since what we do in other cases is a call to clone_ref(py). What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python doesn't enforce ownership in the way Rust does, but we can still think of it in similar terms. DAGCircuit.apply_operation_back has always behaved as if calling it transferred ownership of the gate object, as opposed to QuantumCircuit.append, which copied the object if it detected that you might want to modify it (i.e. it was parametric).

Since the Python object coming in is logically "owned" by the function now, I think it's fine to put it in the cache.

With all the Python stuff, we can't compile-time (or even, in most cases, run-time) forbid modifications to Python objects, so there's always going to be the possibility for a user to break us a bit, and some of our Python-based will be pragmatics. I think in this case, saving the incoming object in the cache is no more dangerous than having the cache at all, so it should be fine to keep.

// Setters

/// Retrieves an immutable reference to the instruction's underlying operation.
pub fn op_mut(&mut self) -> &mut PackedOperation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that all of these _mut methods are only used within the qiskit-circuit crate. Should we introduce these with visibility pub(crate) instead?

Originally PackedInstruction was more of a Rust POD, but now it feels like we should lock down its internals where possible since it is managing its own cache invalidation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this makes more sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making the methods pub(crate) is a bit needlessly restrictive. Anywhere you can get a &mut PackedInstruction, you can effectively change any fields you need to, because you can *instruction = my_new_packed_instruction, so the actual external access restriction for the circuit needs to be purely managed by when you can get a &mut PackedInstruction.

Otherwise, it's just needlessly restrictive - if you can create a new PackedInstruction with pub visibility, then making these methods pub(crate) doesn't actually restrict any capability.

crates/circuit/src/packed_instruction.rs Outdated Show resolved Hide resolved
crates/circuit/src/packed_instruction.rs Outdated Show resolved Hide resolved
crates/circuit/src/packed_instruction.rs Outdated Show resolved Hide resolved
crates/circuit/src/packed_instruction.rs Outdated Show resolved Hide resolved
crates/circuit/src/packed_instruction.rs Outdated Show resolved Hide resolved
params: self.params.clone(),
extra_attrs: self.extra_attrs.clone(),
#[cfg(feature = "cache_pygates")]
py_op: OnceLock::new(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Toward the beginning of the discussion here, @jakelishman says:

the right solution is to manually impl Clone for PackedInstruction to not propagate the cache for gates that aren't Python-space singletons.

Should we be preserving the py_op in the case of singleton gates to avoid a more significant performance regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we be able to identify which ones are singletons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking mostly about what's necessary, rather than what's easy. Technically it's safe to propagate the cache if the Python object itself is immutable. If it's too hard to tell, or we deem it too risky, we can just zero them all out.

Fwiw, we'd basically just apply a Rust-only heuristic: check the PackedOperation is a Standard and it's a 0-parameter gate with extra_attrs == None. That won't catch every single singleton, but it'll do the vast majority and fails safe.

@raynelfss
Copy link
Contributor Author

@kevinhartman I'd honestly be okay with not keeping an initialized py_op in any case whatsoever. We can have it be a cached instance that's only built when needed and discard it anytime we modify a parameter, op, or extra_attrs. I feel this makes much more sense as it would produce more stable results. I wonder how the rest of the team feels about this.

As for what happens if we bring back the deleted assigned py_op instances, the performance is actually way worse (though I'm not quite sure why that is). I made a divergent branch not too long ago in which I added py_op as an argument for PackedInstruction::new() and it resulted in an even greater performance regression during circuit construction:


Change Before [84def52] After [d692313] Ratio Benchmark (Parameter)
+ 121±10μs 187±40μs 1.54 circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8192)
+ 110±2μs 138±10μs 1.25 circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8192)
+ 373±20μs 454±20μs 1.22 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 32768)
+ 114±3μs 133±0.6μs 1.17 circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8192)
+ 112±4μs 131±3μs 1.17 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8192)
+ 33.8±2μs 38.8±0.6μs 1.15 circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 2048)
+ 37.4±1μs 43.0±2μs 1.15 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 2048)
+ 3.19±0.05ms 3.56±0.03ms 1.12 circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 131072)
+ 3.22±0.01ms 3.57±0.03ms 1.11 circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 131072)
+ 3.15±0.03ms 3.50±0.01ms 1.11 circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 131072)
+ 3.21±0.05ms 3.54±0.03ms 1.1 circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 131072)
+ 3.18±0.03ms 3.51±0.04ms 1.1 circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 131072)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.


It's very puzzling as to why this is happening, but if we could make the attribute fully private to the PackedInstruction instance and have it only generate it when needed, it would make a lot more sense to keep this around.

@jakelishman
Copy link
Member

I'm very surprised to see performance dropping if some of the caches are being propagated but not all - it might be a good idea to try and dig heavily into that and really understand why it's happened, especially since this is being considered for a bugfix.

Fwiw, this feels like a very major change to be targetted at a bugfix to me, especially a release that might go out tomorrow. It might worth considering squashing the BasisTranslator problem in a less ideal way for the bugfix release (like just having assign_parameters in Rust space clear the pyop cache rather than attempting to mutate the Python object) that we know is safe and doesn't have the same not-quite-understood performance drawbacks. But of course, I'm still on leave and don't have anywhere near the full context for anything going on here, so you guys should actually make the decision, not me.

@kevinhartman
Copy link
Contributor

Fwiw, this feels like a very major change to be targetted at a bugfix to me, especially a release that might go out tomorrow. It might worth considering squashing the BasisTranslator problem in a less ideal way for the bugfix release (like just having assign_parameters in Rust space clear the pyop cache rather than attempting to mutate the Python object) that we know is safe and doesn't have the same not-quite-understood performance drawbacks. But of course, I'm still on leave and don't have anywhere near the full context for anything going on here, so you guys should actually make the decision, not me.

We're all in agreement that this PR should not merge until after 1.3.1. @raynelfss is working on a separate more localized hotfix for #13504 for this release, and we'll revisit this one in the future when we have time to understand the performance characteristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Changelog: Bugfix Include in the "Fixed" section of the changelog mod: circuit Related to the core of the `QuantumCircuit` class or the circuit library on hold Can not fix yet priority: high stable backport potential The bug might be minimal and/or import enough to be port to stable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basis translation can fail if run in parallel
7 participants